-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up parsejson 3.25x (with a gcc-10.2 PGO build) on number heavy input #16055
Conversation
…input. Also add a couple convenience iterators and update `changelo.md`. The basic idea here is to process the ASCII just once in a combination validation/classification/parse rather than validating/classifying a number in `parsejson.parseNumber`, then doing the same work again in `parseFloat`/`parseInt` and finally doing it a 3rd time in the C library `strtod`/etc. The last is especially slow on some libc's/common CPUs due to `long double`/80-bit extended precision arithmetic on each digit. In any event, the new fully functional `parseNumber` is not actually much more code than the old classifying-only `parseNumber`. Because we aim to do this work just once and the output of the work is an int64|float, this PR adds those exported fields to `JsonParser`. (It also documents two existing but undocumented fields.) One simple optimization done here is pow10. It uses a small lookup table for the power of 10 multiplier. The full 2048*8B = 16 KiB one is bigger than is useful. In truth most numbers in any given run will likely be of similar orders of magnitude, meaning the cache cost would not be heavy, but probably best to not rely only likelihood. So fall back to a fast integer-exponentiation algorithm when out of range. The new `parseNumber` itself is more or less a straight-line parse of scientific notation where the '.' can be anywhere in the number. To do the right power-of-10 scaling later, we need to bookkeep a little more than the old `parseNumber` as noted in side comments in the code. Note that calling code that always wants a `float` even if the textual representation looks like an integer can do something like `if p.tok == tkInt: float(p.i) else: p.f` or similar, depending on context. Note that since we form the final float by integer * powerOf10 and since powers of 10 have some intrinsic binary representational error and since some alternative approaches (on just some CPUs) use 80-bit floats, it is possible for the number to mismatch those other approaches in the least significant mantissa bit (or for the ASCII->binary->ASCII round-trip to not be the identity). On those CPUs only, better matching results can maybe be gotten from an `emit` using `long double` if desired (also with a long double table for powers of 10 and the powers of 10 calculation). This strikes me as unlikely to be truly needed long-term, though.
I won't be able to track down what is going wrong in the tests until at least tomorrow and maybe not until Monday if anyone is in a rush and wants to help. I did test it on parsing a whole big file through the Also, here is a link back to the whole, long forum thread that inspired this. |
Also, instead of slice assignment we could probably |
slice assignment. Within 1.03x of non-copy mode in a PGO build (1.10x if the caller must save the last valid string). One might think that this argues for ditching `strFloats` & `strIntegers` entirely and just always copying. In some sense, it does. However, the non-copy mode is a useful common case here, as shown in `jsonTokens(string)` example code. So, it is a bit of a judgement call whether supporting that last 10% of perf in a useful common case matters.
Does anyone know why I am getting "undeclared identifier: 'copyMem'"s? That seems weird. |
Nevermind. I guess NimScript does not support |
(Someone should make `copyMem` just work everywhere, IMO.)
Also, extract mess into a template.
This is only half related to this PR, but in the |
I have a question. The very end of |
Never mind. I am reading the long discussion. { A lot of never mind's in this one. That'll teach me to code before coffee. ;-) } |
e = e shr 1 | ||
base *= base | ||
|
||
proc parseNumber(my: var JsonParser): TokKind {.inline.} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is not the proc kinda too big to be {.inline.}
?. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. In the by far most common case it is just one line (two almost always perfectly predicted compare-jumps and one array reference).
It is only used just once, for human mental abstraction. So, there is no duplicated generated code.
Also, people are too sparing of {.inline.}
in my view. Function call overhead is like 12..15 cycles * 3..6way superscalar =~ 36..90 dynamic instruction slots. So, if duplication is not a concern, that is the the rough amount of work to compare to.
If you were really going to browbeat me over inline then you should complaining about parseNumber
being inline, not pow10
. ;-) {EDIT: oh, wait, that's what you were doing! Sorry! } But many numbers to parse might be just single-digits which take only a few cycles. In fact, the more numbers you have the more likely many are single digits. So, the function call overhead would then be quite large compared to the work done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing inline buys you here is putting parseNumber
C code in the same translation unit as the parsing itself. So, the C compilers metrics/heuristics for inlining can at least be applied. Otherwise you rely on LTO which a lot of people do not enable. Given that this whole bit of work was to get really fast speed, it seems appropriate for that reason as well as the "many single digit" cases. If you do PGO builds the compiler will actually try to measure if inlining is good or not. According to the GCC guys that is the main benefit of PGO. I am always getting 1.5x to 2.0x speed ups from PGO with Nim which suggests to me that our {.inline.}
choices are not generally so great (or at least a few critical choices are not...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and also parseNumber
is also only called in one place in getTok
. So, still no extra L1 I-cache style resource hit, and I also added {.inline.}
on getTok
to get the meat of all this logic into the translation unit of whatever is using the json parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, Nim's {.inline.}
just marks it in the generated C as inline
. So, it's still up to C/C++ compiler code metrics what ultimately happens. Those surely do take into account the statically observable number of expansions/substitutions and their sizes. So, the many calls to getTok
from json
are not a real problem. But if anyone really wants me to remove them I can. I always use (and recommend using) PGO when performance matters, myself.
this is the only remaining problem. Before merging we should make overflow detection more precise than just a digit count for both `tkInt` and `tkFloat`.
@Araq WARNING: Not ready to merge, though it passes all checks. (See final commit comment.) |
Question probably for @Araq and all those in the prior discussion of this: Integer overflow going from JSON to a Nim The precision dimension seems tricky, though. I know that sometimes people format floats to 18..20 digits just to try to get better round trips through 80-bit FP formats. I don't know how frequently other prog.lang json libs do this, though. Does anyone know? I avoid json due to it being atrociously 30-200x slower than binary formats and consequently don't have a lot of experience with the JSON libs. If they never do more than 16 then there is no problem. If json libs frequently/ever emit an extra digit or 3 a lot then it seems a bit overboard to have 17..20 decimal digits alone launch things into arbitrary precision float arithmetic on the Nim receiver. I guess someone could hook into a preprocessor that truncated/rounded extra digits, but blech. Ultimately, if one doesn't really know how the ASCII was generated then client code needs to provide guidance for this "auto promotion to JString/arb.prec float" idea. So, it seems like it's reason enough for yet another new parameter, So, it seemed like asking how people would like this handled was warranted. With the below idea a Also, it seems like one thing that never came up on that overflow discussion was the idea of storing the This would mean client code expecting ints in the range 2^63..2^64-1 needs to check a Now, I know your first reaction is probably that converting 2^63 into -2^63 seems like a keeerazy footgun. However, if I understand correctly, before an exception was just raised for this data. So, client code had to try/except/do something special anyway to not fail. This would just change what it needs to do to not fail. Another new bool flag could also control whether you get the old exception or new cast behavior. Best of all worlds? I will probably hold off making these overflow conditions more precise until Monday/I hear back from various folks on these questions. Alternatively, I am also fine if you want to merge this & let the imprecise test stand momentarily while someone else works on said overflow checking and/or API issues. |
The |
Also, if "ever storing" the |
Well, if I get no more feedback on the future direction here, I will just do something tomorrow. I'm a little surprised, as the prior discussion seemed pretty passionate. |
lib/pure/parsejson.nim
Outdated
finally: | ||
p.close() | ||
|
||
iterator jsonTokens*(path: string, rawStringLiterals = false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These API additions are not good for two reasons:
- An iterator that always produces the same value is a foreign concept.
- It looks like they are unused. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're unused because the idea was to provide something like json.parseJsonFragments
but in the raw parsing parsejson
context. I think they are also not normal in yielding a ptr JsonParser
, but that actually made like 1.5X performance difference with the new faster logic because the JsonParser object is big enough the alloc/copies showed up a the top of the profile. It mostly seemed the easiest way to really streamline the usage syntax, as in here. I'm open to alternative suggestions to streamline said syntax, or just removing them and leaving them to the user, though.
I was more concerned about your feedback about int64
but not uint64
overflow ideas, going down the options = set[enum]
route and adding a new u: uint64
field for users to conditionally/unconditionally access when they maybe/definitely expect unsigned, and that sort of direction. That's much more to do/get right than, say, moving those iterators to example code/doing something else. So, if you hated the general direction I wanted to hold off on that work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type could be lent T
but I don't know of today's compiler accepts that (it really shouldn't without a spec addition...).
For now please remove these iterators, it's easy enough to write a open/close/loop snippet.
I was more concerned about your feedback about int64 but not uint64 overflow ideas, going down the options = set[enum] route and adding a new u: uint64 field for users to conditionally/unconditionally access when they maybe/definitely expect unsigned, and that sort of direction.
I would use a single int64
field and then accessor procs/templates that hide the cast[uint64](x)
. And options = set[enum]
is the better design yes. I couldn't use it because of backwards compatibility reasons. There is a different solution that deals with the compat problems but it's annoying to do, maybe it's time to make .since
a builtin that works better.
Can this be disabled though? In the case of packedjson there is no reason to parse the number, it's stored as a string. |
Yes. We can add a path based on At the moment, though, I am working on the trickier problem of detecting arbitrary precision float overflow/underflow. :-) (The integer case is easy. glibc actually uses a full bignum library to parse floats in |
Do not merge this. I am working on a cleaner and more general approach in light of edge cases. |
This uses @c-blake's code from nim-lang/Nim#16055 to avoid having to copy the string which we parse and instead directly parses a number into either an int or a string. Note that this implementation has some edge cases that break it. In particular a number that starts valid, but doesn't end as a valid number (which is technically ok, but will break our CSV parser, since our position in the memfile will *not* be on the next separator!).
This uses @c-blake's code from nim-lang/Nim#16055 to avoid having to copy the string which we parse and instead directly parses a number into either an int or a string. Note that this implementation has some edge cases that break it. In particular a number that starts valid, but doesn't end as a valid number (which is technically ok, but will break our CSV parser, since our position in the memfile will *not* be on the next separator!).
Related issues: #12833 #3809 #12152 nim-lang/RFCs#188 |
This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions. |
Also add a couple convenience iterators and update
changelog.md
.The basic idea here is to process the ASCII just once in a combination
validation/classification/parse rather than validating/classifying a
number in
parsejson.parseNumber
, then doing the same work again inparseFloat
/parseInt
and finally doing it a 3rd time in the C librarystrtod
/etc. The last is especially slow on some libc's/common CPUsdue to
long double
/80-bit extended precision arithmetic on each digit.In any event, the new fully functional
parseNumber
is not actually muchmore code than the old classifying-only
parseNumber
.Because we aim to do this work just once and the output of the work is
an int64|float, this PR adds those exported fields to
JsonParser
. (Italso documents two existing but undocumented fields.)
One simple optimization done here is pow10. It uses a small lookup
table for the power of 10 multiplier. The full 2048*8B = 16 KiB one is
bigger than is useful. In truth most numbers in any given run will
likely be of similar orders of magnitude, meaning the cache cost would
not be heavy, but probably best to not rely only likelihood. So fall
back to a fast integer-exponentiation algorithm when out of range.
The new
parseNumber
itself is more or less a straight-line parse ofscientific notation where the '.' can be anywhere in the number. To do
the right power-of-10 scaling later, we need to bookkeep a little more
than the old
parseNumber
as noted in side comments in the code.Note that calling code that always wants a
float
even if the textualrepresentation looks like an integer can do something like
if p.tok == tkInt: float(p.i) else: p.f
or similar, depending on context.Note that since we form the final float by integer * powerOf10 and since
powers of 10 have some intrinsic binary representational error and since
some alternative approaches (on just some CPUs) use 80-bit floats, it is
possible for the number to mismatch those other approaches in the least
significant mantissa bit (or for the ASCII->binary->ASCII round-trip to
not be the identity). On those CPUs only, better matching results can
maybe be gotten from an
emit
usinglong double
if desired (also witha long double table for powers of 10 and the powers of 10 calculation).
This strikes me as unlikely to be truly needed long-term, though.